Introduction

Column

Preamble

Here we present 16S rRNA gene amplicon data for the hypoxic event that took place in the Bay of Almirante, Bocas del Toro, Panama, during the fall of 2017. This document is interactive. You can sort and scroll through most of the tables and some figures are zoomable. In the upper right hand corner of the front page is a Source Code button. Use this to copy the source code of this document. Our goal in the future is to incorporate more interactive figures and more friendly ways of exploring the data. Let’s proceed.

Important Definitions & Abbreviations

  • Amplicon Sequence Variant (ASV): Exact sequence variant—analogous to an OTU—but with single nucleotide resolution. For all you ecologists out there an ASV, for our purposes, is akin to a species.
  • Differentially abundant (DA) feature: Taxa, ASV, etc. that is disproportionately abundant in a group of samples and statistically different than other groups.

The goals of this study were to:

Column

Workflow overview

Part I: Data Preparation

In this first part we go through the steps of defining sample groups, creating phyloseq objects, removing unwanted samples, and removing contaminants. Various parts of this section can easily be modified to perform different analyses. For example, if you were only interested in a specific taxa or group of samples, you could change the code here to create new phyloseq objects.

Part II: Taxonomic Diversity

In the second part, we assess taxonomic composition of these samples. We primarily look at Class-level taxonomic composition to get a birds-eye-view of the system.

Part III: Alpha Diversity

In the third part, we look alpha diversity. Here we use several indices and associated statistics to determine a) the diversity of different environments and b) whether that diversity is statistically different.

Part IV: Beta Diversity

Next we look at beta diversity. DETAILS

Part V:Differentially Abundant ASVs PENDING

We wanted to understand how ASVs partitioned across environment. We also wanted to assess the specificity of each ASV to determine habitat preference. To our knowledge there is no quantitative way to do this. The only attempt we are aware of was MetaMetaDB but it is based on a 454 database and no longer seems to be in active development. So we used an approach based on the work of Sullam et. al., first identifying differentially abundant ASVs, then searching for closest database hits, and finally using phylogenetic analysis and top hit metadata (isolation source, natural host, environment) to infer habitat preference.

Part VI: Synthesis PENDING

In this section we pull together all the results and try to make sense of the microbiomes from these environments. How are ASVs partitioning across environment? How similar are these ASVs to sequences from other studies? What can these patterns tell us about habitat specificity?

I. Physiochemistry

Depth Profiles: Dissolved Oxygen Depth Profiles


'data.frame':   244 obs. of  23 variables:
 $ Date          : chr  "9/21/17" "9/21/17" "9/21/17" "9/21/17" ...
 $ Time          : chr  "10:53:59" "10:54:04" "10:54:09" "10:54:14" ...
 $ site          : chr  "Hospital_Point" "Hospital_Point" "Hospital_Point" "Hospital_Point" ...
 $ temp          : num  30.6 30.7 30.7 30.7 30.7 ...
 $ cond          : num  58312 58368 58380 58401 58445 ...
 $ cond2         : num  52687 52612 52613 52671 52721 ...
 $ sal           : num  34.5 34.5 34.5 34.5 34.6 ...
 $ cond3         : num  52058 51968 51968 52029 52080 ...
 $ tds           : int  34246 34198 34199 34236 34269 34309 34334 34339 34424 34535 ...
 $ ODOsat        : num  78.4 80.9 81.9 82.7 83.5 83.7 83.6 83.5 78.3 76.7 ...
 $ DO            : num  4.86 5 5.06 5.11 5.16 5.17 5.17 5.16 4.83 4.73 ...
 $ pH            : num  7.9 7.92 7.94 7.96 7.98 7.99 8 8 8 7.99 ...
 $ Turbidity     : num  7.9 3.21 0.63 0.6 0.64 0.65 0.71 0.69 0.67 0.64 ...
 $ TSS           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Chlorophyll   : num  0.6 0.42 0.55 0.56 0.65 0.74 0.74 0.71 0.79 0.9 ...
 $ ChlorophyllugL: num  2.98 2.11 2.75 2.8 3.24 3.69 3.72 3.55 3.94 4.47 ...
 $ BGA_RFU       : num  1.35 1.6 1.69 1.75 1.92 2.01 2.06 2.11 2.26 2.36 ...
 $ BGAugL        : num  3.6 4.28 4.52 4.71 5.16 5.4 5.54 5.69 6.09 6.36 ...
 $ fDOMRFU       : num  0.15 0.21 0.27 0.29 0.26 0.26 0.27 0.27 0.28 0.3 ...
 $ fDOMQSU       : num  0.57 0.75 0.96 1.02 0.94 0.92 0.95 0.97 0.99 1.05 ...
 $ pressure      : num  1.8 1.82 1.84 3.55 3.62 ...
 $ depth_m       : num  1.23 1.24 1.25 2.44 2.49 ...
 $ site_o        : Factor w/ 8 levels "Crawl_Key","Cayo_Roldon",..: 3 3 3 3 3 3 3 3 3 3 ...

II. 16S rRNA Data Prep

1. Define sample groups: To prepare for downstream analysis, we first add group designations.

  1. The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (combo_pipeline.rdata) contains the ASV-by-sample table and the ASV taxonomy table.

  2. After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.

  1. And finally we define a sample data frame that holds the different groups we extracted from the sample names. On the right are the sample names and the different group categories.

Sample abbreviations are as follows:

X indicates the Environment

  • W = Water
  • S = Sediment
  • C = Coral
  • M = Mat

YY indicates the site name

  • CC = Coral Caye
  • CR = Cayo Roldan

Z indicates the replicate number

So MCR5 is a Mat sample from Cayo Roldan, replicate 5.


SamName ENV SITE
CCC1 Coral Coral Caye
CCC2 Coral Coral Caye
CCC3 Coral Coral Caye
CCC4 Coral Coral Caye
CCC5 Coral Coral Caye
CCC6 Coral Coral Caye
CCR1 Coral Cayo Roldan
CCR2 Coral Cayo Roldan
CCR3 Coral Cayo Roldan
CCR4 Coral Cayo Roldan
CCR5 Coral Cayo Roldan
CCR6 Coral Cayo Roldan
MCR1 Mat Cayo Roldan
MCR2 Mat Cayo Roldan
MCR3 Mat Cayo Roldan
MCR5 Mat Cayo Roldan
SCC1 Sediment Coral Caye
SCC2 Sediment Coral Caye
SCC3 Sediment Coral Caye
SCR1 Sediment Cayo Roldan
SCR2 Sediment Cayo Roldan
SCR3 Sediment Cayo Roldan
WCC0 Water Coral Caye
WCC1 Water Coral Caye
WCC2 Water Coral Caye
WCC3 Water Coral Caye
WCR0 Water Cayo Roldan
WCR1 Water Cayo Roldan
WCR2 Water Cayo Roldan
WCR3 Water Cayo Roldan

2. Create & modify a phyloseq object: a phyloseq (ps) object serves as the basis for all downstream analyses.

  1. Create a phyloseq (ps) object using the silva (slv) taxonomy generated in the DADA2 workflow.

  2. Rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.

[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"
  1. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file.
  1. Export sequence and taxonomy tables for the unadulterated phyloseq object to the DATA_TABLES directory for later use. We will use the prefix full to indicate that these are raw data products.

Optional

  1. You can also remove any samples at this point. To do this you must select the samples you wish to keep. If you want to change the group of samples, modify the script accordingly. For now this function is off meaning you must modify the eval flag in the code chunk and list sample names for this to run.

  2. If you remove samples you probably lost some ASVs. So you need to get rid of any ASVs that have a total of 0 reads.


This is our raw phyloseq object.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 6182 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 6182 taxa by 8 taxonomic ranks ]

The phyloseq object contains:

  • 400887 total reads
  • across 6182 ASVs
  • and 30 samples

3. Remove contaminants: In this case mitochondria, chloroplasts, & Eukaryotic ASVs

  1. Remove any Mitochondria hits. Remember the original dataset contained 6182 ASVs.

To accomplish this we do the following:

  • Subset the taxa to generate a ps object of just Mitochondria,
  • Select the ASV column only, turn it into a factor, and use this to remove Mitochondria from the ps object.
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 6155 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 6155 taxa by 8 taxonomic ranks ]

Looks like this removed 27 Mitochondria ASVs. We will duplicate the code block to remove other groups. We will also get rid of any Eukaryota and Chloroplast ASVs.

  1. Remove any Chloroplast hits.
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5955 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5955 taxa by 8 taxonomic ranks ]

The code removed an additional 200 ASVs classified as Chloroplast.

  1. Remove any Eukaryotic hits.
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5949 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5949 taxa by 8 taxonomic ranks ]

Finally, 6 Eukaryota ASVs were eliminated from the ps object.


Note the command used for this operation (subset_taxa within phyloseq) removes anything that is NA for the taxonomic level of interest or above. That’s fine if you are working at the Phylum level—for example is you are only getting rid of unknown Bacteria and unknown unknowns. However if you want to run this command using Family != "Mitochondria" you will not only get rid all ASVs classified as Mitochondria but everything else that has NA for Family and above. This seems pretty insane to me. Our dataset only had 27 Mitochondria ASVs but running the command as is will remove any ASVs not classified at Family level or above. And it is not well documented that the command is doing this—I had to go digging through the files to figure out what was happening. So lets see if we can get rid of just Mitochondria ASVs.


Our final working phyloseq object (ps_slv_work_filt) contains

  • 380962 total reads
  • and 5949 ASVs.

The mean number of reads per sample is 12698. The minimum number of reads per sample is 170 while the maximum is 35444 reads.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5949 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5949 taxa by 8 taxonomic ranks ]

We will again export the sequence and taxonomy tables to the DATA_TABLES directory this time with the file prefix trim to indicate that contaminants have been removed from these data.

4. Sample summary: add total reads and total ASVs to sample summary table


At this point we can amend the sample summary table with total reads and ASVs for each sample. If you sort the table by total_reads you can see that 5 samples (all Coral) have less than 1000 reads. We should probably eliminate those samples from the analysis. TODO


One last thing to do is to create a merged phyloseq object grouped by environment. This will come in handy later for some analyses. Basically we collapsed all samples from the same environment type together.

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5949 taxa and 4 samples ]
sample_data() Sample Data:       [ 4 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5949 taxa by 8 taxonomic ranks ]
[1] "Coral"    "Mat"      "Sediment" "Water"   

Great, still 5949 ASVs but now only 4 “samples” corresponding to the unique environment types.

5. Conclusion of Data Preparation Section

ps_slv

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 6182 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 6182 taxa by 8 taxonomic ranks ]

This phyloseq object contains 6182 ASVs, 400887 total reads, and 30 samples.

ps_slv_work_filt

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5949 taxa and 30 samples ]
sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5949 taxa by 8 taxonomic ranks ]

This phyloseq object contains 5949 ASVs, 380962 total reads, and 30 samples after removing Mitochondria, Chloroplast, and Eukaryotes.

mergedGP

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 5949 taxa and 4 samples ]
sample_data() Sample Data:       [ 4 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 5949 taxa by 8 taxonomic ranks ]

This phyloseq object contains 5949 ASVs, 380962 total reads, and 30 samples after collapsing by Environment (ENV).


At this point in the workflow there are the several different phyloseq objects to chose from and, using the above methods, additional objects can easily be created.

III. Taxonomic Diversity

Overview

Much of the subsequent taxonomic analyses will be at the Class & Family levels. We chose not to focus on the Genus level because there simply is not enough resolution in our dataset to build a cohesive story. This is because most marine systems are (microbially) understudied and we are dealing with short read data. On the other hand, Phylum level is too coarse for groups like Proteobacteria and Firmicutes. Order did not provide any additional information and can be cumbersome for taxa with poorly resolved lineages. Depending on the dataset, you may want to change your strategy.

Color blindness & graphics

Before we continue we need to talk about color. Microbial diversity is huge and it is difficult to display all of this diversity in a single, static figure. As microbial ecologists we often use colors to convey our message. Yes many of us have a decreased ability to see color or differences in color. So for our figures we are going to create a custom, color friendly palette based on Bang Wong’s scheme described in this paper. Wong’s color scheme uses contrasting colors that can be distinguished by folks with color vision deficiency—roughly 8% of people (mostly males) are color blind. Do want Keanu Reeves to understand your figures or not?

This scheme is conservative—there are only 7 colors. We added black and grey to give us a little wiggle room. Others have developed 12 and 15 color palettes, but be careful—figures with too many colors can inhibit our ability to discern patterns. Our conservative palette forces us to choose carefully when deciding which taxa to target or how many groups to display.


We created two different color palettes—one for taxa and one for environment—using the same 9 colors.

Total reads & ASVs by Taxa: What are the dominant taxa in this system? Here we take a Class-level look at overall diversity.


Most reads by taxa: Alphaproteobacteria, Bacteroidia, Clostridia, Deltaproteobacteria, Gammaproteobacteria, Oxyphotobacteria

Most ASVs by taxa: Alphaproteobacteria, Bacteroidia, Clostridia, Deltaproteobacteria, Gammaproteobacteria, Planctomycetacia

Relative read abundance: In this section we calculate relative percent read abundance for the combined dataset. This allows us to collapse rare taxa into an Other category.


Lets take a closer look Class-level taxonomic content of these communities. There are numerous ways to do this but here we chose to collapse samples by environment and display the relative abundance of the most dominant taxa. We also generated alternative views of taxonomic composition for individual samples—a box-and-whisker plot as well as separate bar plots.

I know stacked bar charts are pretty lame but we like them for a birds eye view of the data. Here we calculate the relative abundance of taxa for each environment at the Class level. It turns out this is not too easy in phyloseq and there is a lot of (messy) code.

Since our goal is to generate a figure and we only have 9 colors, some taxa will need to be put into an Other category. We can define ‘Other’ however we like so lets take a look at the overall relative abundance of each Class.

Inspecting the table it looks like if we choose a cutoff of 0.01744 (1.744%) we get 8 taxa—sounds pretty good. The rest go into the ‘Other’ category. No matter what, we will always gloss over some groups using such a coarse approach. But as we see later some of these lower abundance groups will reappear when we look at communities at the level of individual ASVs. Next we define the Other category and draw the graph.

Class abundance across sample type: Here we generate a bar chart to look at taxonomic abundance for each of the main environment types.

Abundance of bacterial taxa across environment

Abundance of bacterial taxa across environment


At a 1.5% abundance cutoff, 91 Classes are grouped into the ‘Other’ category. Great, now we can craft the bar chart. And here are the taxa that will go into the chart:

It took some tweaking to get the bar chart to look just right—so there is a lot of code here—and it could most certainly be better :/. While we’re at it we will also save a copy of the figure so we can tweak it later and make it look pretty.

One thing to notice is that the sediment samples have a large percentage of ASVs grouped into the Other category. We will see what these taxa are in the next panel.

Class abundance across sample type: Here we generate a modified bar chart separated by individual samples.


Here we see the same data as in the last figure except now presented as individual samples. We see that the sediment samples all have a larger percentage of Other taxa. If we examine Phylum level abundance in sediment samples we see several groups that are not well represented in the rest of the dataset including

Phylum mean Sed samples mean All samples
Proteobacteria 0.3913 0.3347
Bacteroidetes 0.1658 0.2798
Planctomycetes 0.1237 0.0376
Acidobacteria 0.0879 0.0186
Latescibacteria 0.0267 0.0095
Chloroflexi 0.0218 0.0045
Gemmatimonadetes 0.0204 0.0045
Spirochaetes 0.0184 0.0160
Thaumarchaeota 0.0158 0.0033
Kiritimatiellaeota 0.0147 0.0041
Calditrichaeota 0.0135 0.0027
Nanoarchaeaeota 0.0121 0.0028
Nitrospirae 0.0105 0.0021
Cyanobacteria 0.0078 0.1072
Euryarchaeota 0.0062 0.0099
Firmicutes 0.0057 0.0910
Crenarchaeota 0.0055 0.0014
Modulibacteria 0.0050 0.0010
Zixibacteria 0.0042 0.0008
BRC1 0.0033 0.0006
NA 0.0030 0.0016
Fibrobacteres 0.0030 0.0006
Actinobacteria 0.0026 0.0078
Hydrogenedentes 0.0025 0.0005
Lentisphaerae 0.0024 0.0007
Omnitrophicaeota 0.0024 0.0005
Fusobacteria 0.0019 0.0063
Marinimicrobia_(SAR406_clade) 0.0017 0.0048
Verrucomicrobia 0.0016 0.0026
Dadabacteria 0.0015 0.0009
Elusimicrobia 0.0013 0.0003
PAUC34f 0.0012 0.0002
Nitrospinae 0.0012 0.0003
Asgardaeota 0.0011 0.0002
Schekmanbacteria 0.0010 0.0002
Entotheonellaeota 0.0009 0.0002
Tenericutes 0.0008 0.0261
AncK6 0.0008 0.0001
Hydrothermarchaeota 0.0006 0.0001
LCP-89 0.0005 0.0001
Aegiribacteria 0.0003 0.0001
Dependentiae 0.0003 0.0001
Cloacimonetes 0.0002 0.0046
Patescibacteria 0.0002 0.0003
TA06 0.0002 0.0000
Deferribacteres 0.0001 0.0007
WS2 0.0001 0.0000
Deinococcus-Thermus 0.0001 0.0001
Margulisbacteria 0.0001 0.0000
Epsilonbacteraeota 0.0001 0.0039
Altiarchaeota 0.0001 0.0000
CK-2C2-2 0.0000 0.0000
WS1 0.0000 0.0000
FCPU426 0.0000 0.0000
Atribacteria NA 0.0000
Synergistetes NA 0.0000
Thermotogae NA 0.0006
WPS-2 NA 0.0001

IV. \(\alpha\)-Diversity

Overview: Armed with a picture of taxonomic composition we can move on to diversity estimates.

\(\alpha\)-diversity estimates Here we use several diversity indices and add the results to the summary table.


Alpha diversity describes the diversity in a sample or site. There are several alpha diversity metrics available in phyloseq: Observed, Chao1, ACE, Shannon, Simpson, InvSimpson, Fisher. Play around to see how different metrics change or confirm these results.

Here we want to know if diversity is significantly different across environments. In order to do that we need to know if we should run a parametric or non-parametric test, and for that we need to know if our data is normally distributed. Most of the ideas/code for alpha (and subsequent beta) diversity statistics come from this workshop tutorial by Kim Dill-McFarland and Madison Cox.

First we run the diversity estimates, add these data to our summary table, and save a copy of this table.

\(\alpha\)-diversity statistics: Test whether the results are normally distributed. Results of this will help guide the analyses we do next.

Shapiro-Wilk Normality Test for Shannon index.

    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$Shannon
W = 0.93, p-value = 0.06
Shapiro-Wilk Normality Test for inverse Simpson index.

    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$InvSimpson
W = 0.71, p-value = 0.000002
Shapiro-Wilk Normality Test for Chao1 richness estimator.

    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$Chao1
W = 0.89, p-value = 0.004
Shapiro-Wilk Normality Test for Observed ASV richness estimator.

    Shapiro-Wilk normality test

data:  sample_data(ps_slv_work_filt_div)$Observed
W = 0.89, p-value = 0.004

Next, we add the diversity estimates to our phyloseq object, and test if the data are normally distributed using Shapiro-Wilk Normality test. We will focus on the inverse Simpson and Shannon diversity estimates, Chao’s richness estimate, and Observed richness but this approach can be used for any metric.

Ok, since the p-values are significant for the inverse Simpson, Chao richness, and Observed ASV richness we reject the null hypothesis that these data are normally distributed. However, the Shannon estimates appear normally distributed. So lets see if diversity is significantly different between environment based on the Shannon index.

Assessing significance of \(\alpha\)-diversity metrics: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.

Normally distributed metrics

            Df Sum Sq Mean Sq F value  Pr(>F)    
ENV          3   48.2   16.07    46.3 1.4e-10 ***
Residuals   26    9.0    0.35                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Shannon ~ ENV, data = sampledataDF)

$ENV
                  diff       lwr     upr  p adj
Mat-Coral       2.4546  1.521457  3.3878 0.0000
Sediment-Coral  3.1593  2.351115  3.9674 0.0000
Water-Coral     0.7472  0.009466  1.4850 0.0463
Sediment-Mat    0.7046 -0.338705  1.7480 0.2726
Water-Mat      -1.7074 -2.697228 -0.7176 0.0004
Water-Sediment -2.4121 -3.284983 -1.5391 0.0000

Non-normally distributed metrics

Kruskal-Wallis of inverse Simpson index.


    Kruskal-Wallis rank sum test

data:  InvSimpson by ENV
Kruskal-Wallis chi-squared = 20, df = 3, p-value = 0.0002

Pairwise significance test for inverse Simpson index.


    Pairwise comparisons using Wilcoxon rank sum test 

data:  sampledataDF$InvSimpson and sampledataDF$ENV 

         Coral  Mat   Sediment
Mat      0.002  -     -       
Sediment 0.0006 0.011 -       
Water    0.792  0.006 0.002   

P value adjustment method: fdr 

Kruskal-Wallis of Chao1 richness estimator.


    Kruskal-Wallis rank sum test

data:  Chao1 by ENV
Kruskal-Wallis chi-squared = 26, df = 3, p-value = 0.00001
Pairwise significance test for Chao1 richness estimator.

    Pairwise comparisons using Wilcoxon rank sum test 

data:  sampledataDF$Chao1 and sampledataDF$ENV 

         Coral  Mat   Sediment
Mat      0.002  -     -       
Sediment 0.0003 0.067 -       
Water    0.0001 0.034 0.001   

P value adjustment method: fdr 

Kruskal-Wallis of Observed ASV richness index.


    Kruskal-Wallis rank sum test

data:  Observed by ENV
Kruskal-Wallis chi-squared = 26, df = 3, p-value = 0.00001

Pairwise significance test for Observed ASV richness index.


    Pairwise comparisons using Wilcoxon rank sum test 

data:  sampledataDF$Observed and sampledataDF$ENV 

         Coral  Mat   Sediment
Mat      0.002  -     -       
Sediment 0.0003 0.067 -       
Water    0.0001 0.019 0.001   

P value adjustment method: fdr 

Since the Shannon data is normally distributed we can test for significance using ANOVA (a parametric test).

Ok, the results of the ANOVA are significant. Here we use the Tukey’s HSD (honestly significant difference) post-hoc test to determine which pairwise comparisons are different.

Looks like everything is significantly different except for Mat-Sediment communities from Cayo Roldan. Interesting.

Now we can look at the results on the inverse Simpson diversity and Chao’s richness. Since Environment is categorical, we use Kruskal-Wallis (non-parametric equivalent of ANOVA) to test for significance.

For the inverse Simpson, Chao1, and Observed richness the results of the Kruskal-Wallis rank sum test are significant. So we can look at pairwise comparisons using Wilcoxon rank sum test for post-hoc analysis.

\(\alpha\)-diversity plots: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.

Figure 2B

Figure 2B


Here we plot the results of each diversity index and include the Observed ASV richness. We will save a copy of the figure for later tweaking. We use the color palette described above to delineate environments.

V. \(\beta\)-Diversity

\(\beta\)-diversity analysis using : NMDS coupled with Jensen–Shannon divergence.


Call:
metaMDS(comm = veganifyOTU(physeq), distance = distance) 

global Multidimensional Scaling using monoMDS

Data:     wisconsin(sqrt(veganifyOTU(physeq))) 
Distance: bray 

Dimensions: 2 
Stress:     0.1609 
Stress type 1, weak ties
Two convergent solutions found after 20 tries
Scaling: centring, PC rotation, halfchange scaling 
Species: expanded scores based on 'wisconsin(sqrt(veganifyOTU(physeq)))' 

Beta diversity basically tells us how similar or dissimilar samples are to one another. Phyloseq offers several ordination methods and distance metrics. Here we use non metric multidimensional scaling (NMDS) coupled with Jensen–Shannon divergence. We also save a copy of the figure for later tweaking. To see the full output of the NMDS analysis, remove the results = 'hide' tag from the code chunk.

We see that a convergent solution was reached around 20 iterations and our stress is below 0.20, meaning that 2-axes are sufficient to view the data. Generally, we are looking for stress values below 0.2. If the stress values are high, you may need to add more axes to the ordination. Lets visualize the plot.

\(\beta\)-diversity NMDS plot :


So we can see some clustering within groups and spread between groups, but this is not a test for statistical differences. Do microbial communities differ significantly by environment?

\(\beta\)-diversity statistics :

First we use the adonis function in vegan to run a PERMANOVA test. This will tell us whether environments have similar centroids or not.


Call:
adonis(formula = fish.jsd ~ ENV, data = sampledf, permutations = 1000) 

Permutation: free
Number of permutations: 1000

Terms added sequentially (first to last)

          Df SumsOfSqs MeanSqs F.Model   R2 Pr(>F)    
ENV        3      2.82   0.939    7.68 0.47  0.001 ***
Residuals 26      3.18   0.122         0.53           
Total     29      5.99                 1.00           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

These results indicate that centroids are significantly different across environments meaning that communities are different by environments.

We can also use the pairwiseAdonis package for pair-wise PERMANOVA analysis.

              pairs F.Model     R2 p.value p.adjusted sig
1      Coral vs Mat   4.391 0.2388   0.001      0.006   *
2 Coral vs Sediment   4.845 0.2324   0.001      0.006   *
3    Coral vs Water   8.384 0.3178   0.001      0.006   *
4   Mat vs Sediment   9.246 0.5361   0.008      0.048   .
5      Mat vs Water  19.886 0.6654   0.002      0.012   .
6 Sediment vs Water  15.525 0.5640   0.001      0.006   *

Here we see again we see that communities are different by environments.

However, PERMANOVA assumes equal beta dispersion so we will use the betadisper function from the vegan package to calculate beta dispersion values.


    Homogeneity of multivariate dispersions

Call: betadisper(d = fish.jsd, group = sampledf$ENV, bias.adjust =
TRUE)

No. of Positive Eigenvalues: 28
No. of Negative Eigenvalues: 1

Average distance to median:
   Coral      Mat Sediment    Water 
   0.444    0.197    0.324    0.193 

Eigenvalues for PCoA axes:
(Showing 8 of 29 eigenvalues)
PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8 
1.313 0.831 0.804 0.445 0.411 0.258 0.240 0.221 

And then a pair-wise Permutation test for homogeneity of multivariate dispersions using permutest (again from the vegan package).


Permutation test for homogeneity of multivariate dispersions
Permutation: free
Number of permutations: 1000

Response: Distances
          Df Sum Sq Mean Sq    F N.Perm Pr(>F)    
Groups     3  0.375  0.1251 12.6   1000  0.001 ***
Residuals 26  0.258  0.0099                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairwise comparisons:
(Observed p-value below diagonal, permuted p-value above diagonal)
            Coral      Mat Sediment Water
Coral             9.99e-04 9.99e-04  0.00
Mat      6.50e-10          2.00e-03  0.98
Sediment 1.78e-06 1.59e-04           0.12
Water    1.94e-04 9.67e-01 1.15e-01      

These results are significant, meaning that environments have different dispersions. Looking at the pairwise p-values and permuted p-value, we see that the significant differences (p-value < 0.05) are between most environments.

This means we are less confident that the PERMANOVA result is a real result, and that the result is possibly due to differences in group dispersions.

We can also use Analysis of Similarity (ANOSIM)—which does not assume equal group variances—to test whether overall microbial communities differ by environment.


Call:
anosim(x = distance(ps_slv_work_filt, "jsd"), grouping = spgroup) 
Dissimilarity: 

ANOSIM statistic R: 0.698 
      Significance: 0.001 

Permutation: free
Number of permutations: 999

Upper quantiles of permutations (null model):
   90%    95%  97.5%    99% 
0.0916 0.1263 0.1446 0.1744 

Dissimilarity ranks between and within classes:
         0%    25%   50%    75% 100%   N
Between  52 173.75 260.5 343.25  411 320
Coral    28  71.25 107.5 228.50  411  66
Mat      13  14.25  15.5  16.75   18   6
Sediment 22  35.50  40.0  50.50   57  15
Water     1   7.75  20.5  31.25   48  28

And the AN0SIM result is significant meaning that environment influences microbial community composition.


To test whether microbial communities differ by environment we can use permutational analysis of variance (PERMANOVA) or analysis of similarity (ANOSIM). PERMANOVA does not assume normality but does assume equal beta dispersion between groups. We will test beta dispersion below.

---
title: Hypoxia
output: 
  flexdashboard::flex_dashboard:
    source_code: embed
    orientation: columns
    vertical_layout: fill
    theme: cerulean
    number_sections: true
---
```{r setup, include=FALSE}
remove(list = ls())
library(flexdashboard)
library(ShortRead); packageVersion("ShortRead")
library(phyloseq); packageVersion("phyloseq")
library(ggplot2); packageVersion("ggplot2")
library("plyr"); packageVersion("plyr")
library(vegan)
library(scales)
library(grid)
library(reshape2)
library(rstudioapi)
library(knitr)
library(kableExtra)
library(data.table)
library(DT)
library(rmarkdown)
library(pander)
library(formatR)
library(tidyverse)
library(gridExtra)
library(grid)
library(grDevices)
library(svgPanZoom) 
library(RCurl)
library(plotly)
library(pairwiseAdonis)
library(scales)
library(dplyr)
library(lubridate)
sessionInfo()
set.seed(0199)
```

```{r set_wd, include=FALSE}
knitr::opts_knit$set(root.dir = getwd())
# This will setwd to wherever the .Rmd file is opened.
ptm <- proc.time()
start_time <- Sys.time()
# opts_chunk$set(cache=TRUE)
#formatR::tidy_app() run this in R to tidy code. How to do it here?
```

Introduction 
=========================================

Column
-----------------------------------------

### Preamble

Here we present 16S rRNA gene amplicon data for the hypoxic event that took place in the Bay of Almirante, Bocas del Toro, Panama, during the fall of 2017. This document is  **interactive**. You can **sort and scroll** through most of the tables and some figures are  **zoomable**. In the upper right hand corner of the front page is a `Source Code` button. Use this to copy the source code of this document. Our goal in the future is to incorporate more interactive figures and more friendly ways of exploring the data. Let's proceed.

> Important Definitions & Abbreviations  

* [Amplicon Sequence Variant](https://www.nature.com/articles/ismej2017119){target="_blank"} (**ASV**): Exact sequence variant---analogous to an OTU---but with single nucleotide resolution. For all you ecologists out there an ASV, for our purposes, is akin to a species.  
* Differentially abundant (**DA**) feature: Taxa, ASV, etc. that is disproportionately abundant in a group of samples and statistically different than other groups. 

***

The goals of this study were to:

1. 
2. 
3. 
4.


Column
-----------------------------------------

###Workflow overview

#####[Part I: Data Preparation](#Part I: Data Preparation)

In this first part we go through the steps of defining sample groups, creating phyloseq objects, removing unwanted samples, and removing contaminants. Various parts of this section can easily be modified to perform different analyses. For example, if you were only interested in a specific taxa or group of samples, you could change the code here to create new phyloseq objects.

#####[Part II: Taxonomic Diversity](#Part II: Taxonomic Diversity)  

In the second part, we assess taxonomic composition of these samples. We primarily look at Class-level taxonomic composition to get a birds-eye-view of the system.

#####[Part III: Alpha Diversity](#Part III: Alpha Diversity) 

In the third part, we look alpha diversity. Here we use several indices and associated statistics to determine a) the diversity of different environments and b) whether that diversity is statistically different. 

#####[Part IV: Beta Diversity](#Part IV: Beta Diversity)  

Next we look at beta diversity. DETAILS

#####[Part V:Differentially Abundant ASVs](#Part V: Differentially Abundant ASVs)  PENDING

We wanted to understand how ASVs partitioned across environment. We also wanted to assess the specificity of each ASV to determine habitat preference. To our knowledge there is no quantitative way to do this. The only attempt we are aware of was [MetaMetaDB](http://mmdb.aori.u-tokyo.ac.jp/){target="_blank"} but it is based on a 454 database and no longer seems to be in active development. So we used an approach based on the work of [Sullam *et. al.*](https://doi.org/10.1111/j.1365-294X.2012.05552.x){target="_blank"}, first identifying differentially abundant ASVs, then searching for closest database hits, and finally using phylogenetic analysis and top hit metadata (isolation source, natural host, environment) to infer habitat preference.

#####[Part VI: Synthesis ](#Part VI: Synthesis) PENDING

In this section we pull together all the results  and try to make sense of the microbiomes from these environments. How are ASVs partitioning across environment? How similar are these ASVs to sequences from other studies? What can these patterns tell us about habitat specificity?


I. Physiochemistry  {.storyboard}
=========================================

### **Depth Profiles**:  Dissolved Oxygen Depth Profiles  {data-commentary-width=600}

```{r, fig.width = 6, fig.height=4}
ds <- read.csv(file="DATA_TABLES/depth_profiles.csv",
                  stringsAsFactors = FALSE)

ds$site_o= factor(ds$site, levels=c('Crawl_Key','Cayo_Roldon','Hospital_Point','Finca',
                                    'Mainland', 'Pastores', 'Punta_Caracol', 'Almirante'))
p1   <- ggplot(ds, aes(DO,depth_m)) +
  geom_point(shape = 16, size = 1.5, color = "darkblue") +
  facet_wrap(~ site_o, ncol = 2) +
  labs(#title = "Dissolved Oxygen Depth Profiles",
      # subtitle = "Data plotted by site, Sept. 21, 2017",
       y = "depth (m)",
       x = "DO mg/L") +
  geom_vline(xintercept = 2, linetype="dashed",color = "red", size=.5) +
  scale_y_continuous(trans = "reverse",  limits = c(11,0))
plotly::ggplotly(p1)
#geom_smooth(method = loess, color = "darkblue")+
#theme_bw(base_size = 10) +
# set scale to 11 m, add smoother line, and add hypoxia marker
```

***

```{r}
str(ds)
```

II. 16S rRNA Data Prep {.storyboard}
=========================================

### 1. **Define sample groups**: To prepare for downstream analysis, we first add  group designations.  {data-commentary-width=500}

1. The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (`combo_pipeline.rdata`) contains the ASV-by-sample table and the ASV taxonomy table. 

2. After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.

```{r deliniate_sample_types}
load("../01_DADA2_WORKFLOW/combo_pipeline.rdata")
samples.out <- rownames(seqtab)
subject <- sapply(strsplit(samples.out, "[[:digit:]]"), `[`, 1)
# this splits the string at first instance of a digit
sample_name <- substr(samples.out, 1, 999)  # use the whole string for individuals
environ <- substr(samples.out, 0, 1)  # use the first two letters for genus
site <- substr(samples.out, 2, 3)  # use the next three letters for species
num_samp <- length(unique(sample_name))
num_environ <- length(unique(environ))
num_sites <- length(unique(site))
```

3. And finally we define a sample data frame that holds the different groups we extracted from the sample names. On the right are the  sample names and the different group categories.  

> Sample abbreviations are as follows: 

**X** indicates the Environment

* W = Water
* S = Sediment
* C = Coral
* M = Mat

**YY** indicates the site name

* CC = Coral Caye
* CR = Cayo Roldan

**Z** indicates the replicate number

So  `MCR5` is a Mat sample from Cayo Roldan, replicate 5. 

*** 

```{r define_variables}
#define a sample data frame
samdf <- data.frame(SamName = sample_name, ENV = environ, SITE = site) 
samdf$SITE <- gsub('CC', 'Coral Caye', samdf$SITE)
samdf$SITE <- gsub('CR', 'Cayo Roldan', samdf$SITE)
samdf$ENV <- gsub('W', 'Water', samdf$ENV)
samdf$ENV <- gsub('S', 'Sediment', samdf$ENV)
samdf$ENV <- gsub('C', 'Coral', samdf$ENV)
samdf$ENV <- gsub('M', 'Mat', samdf$ENV)

rownames(samdf) <- samples.out
kable(samdf, row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)  %>%
  column_spec(1:3, width = "3.5cm")
```

### 2. **Create & modify a phyloseq object**: a phyloseq (`ps`) object serves as the basis for all downstream analyses.{data-commentary-width=500}

1. Create a `phyloseq` (ps) object using the silva (slv) taxonomy generated in the DADA2 workflow. 

2. Rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can  get cumbersome downstream, so we rename the ASVs using a simpler convention---ASV1, ASV2, ASV3, and so on. 

```{r create_ps_object}
ps_slv <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE), sample_data(samdf), 
    tax_table(tax_silva)) # this create the phyloseq object
tax_table(ps_slv) <- cbind(tax_table(ps_slv), rownames(tax_table(ps_slv)))
taxa_names(ps_slv) <- paste0("ASV", seq(ntaxa(ps_slv))) # adding unique ASV names
tax_table(ps_slv) <- cbind(tax_table(ps_slv), rownames(tax_table(ps_slv)))
head(taxa_names(ps_slv))
```

3. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file.

```{r add_ASV_coulmn}
colnames(tax_table(ps_slv)) <- c("Kingdom", "Phylum", "Class", "Order", 
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
```

4. Export sequence and taxonomy tables for the unadulterated phyloseq object to the `DATA_TABLES` directory for later use. We will use the prefix `full` to indicate that these are raw data products. 

```{r export_seq_tax_tables}
write.table(tax_table(ps_slv), "DATA_TABLES/OTHER_TABLES/full_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv)), "DATA_TABLES/OTHER_TABLES/full_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```

**Optional**

5. You can also remove  any samples at this point. To do this you must select the samples you *wish to keep*. If you want to change the group of samples, modify the script accordingly. For now this function  is *off* meaning you must modify the `eval` flag in the code chunk and list sample names for this to run.

```{r select_samples, eval = FALSE, include = FALSE}
ps_slv_base <- prune_samples(c("SAMPLE1", "SAMPLE2", ...), ps_slv)
ps_slv_base
```

6. If you remove samples you probably lost some ASVs. So you need to get rid of any ASVs that have a total of **0 reads**. 

```{r remove_ASV_with_zeros_reads, eval = FALSE, include = FALSE}
ps_slv_work <- prune_taxa(taxa_sums(ps_slv_base) > 0, ps_slv_base)
ps_slv_work
```

***
This is our raw phyloseq object.

```{r, results = 'markdown' }
ps_slv
```

The phyloseq object contains: 

* `r sum(otu_table(ps_slv))` total reads  
* across `r ntaxa(ps_slv)` ASVs 
* and `r nsamples(ps_slv)` samples 

### 3. **Remove contaminants**: In this case mitochondria, chloroplasts, & Eukaryotic ASVs {data-commentary-width=600}

1. Remove any Mitochondria hits. Remember the original dataset contained `r ntaxa(ps_slv)` ASVs.

To accomplish this we do the following:

* Subset the taxa to generate a `ps` object of just Mitochondria, 
* Select the ASV column only, turn it into a factor, and use this to remove Mitochondria from the `ps` object. 

```{r remove_specific_taxa}
# generate a file with mitochondria ASVs
MT1 <- subset_taxa(ps_slv, Family == "Mitochondria")
MT1 <-  as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps_slv), MT1df)
ps_slv_work_no_mito <- prune_taxa(goodTaxa, ps_slv)
ps_slv_work_no_mito
```

Looks like this removed **`r ntaxa(ps_slv) - ntaxa(ps_slv_work_no_mito)` Mitochondria ASVs**. We will duplicate the code block to remove other groups. We will also get rid of any Eukaryota and Chloroplast ASVs. 

2. Remove any Chloroplast hits.

```{r remove_specific_taxa2}
# generate a file with mitochondria ASVs
CH1 <- subset_taxa(ps_slv_work_no_mito, Order == "Chloroplast")
CH1 <-  as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_slv_work_no_mito), CH1df)
ps_slv_work_no_chloro <- prune_taxa(goodTaxa, ps_slv_work_no_mito)
ps_slv_work_no_chloro
```

The code removed an additional **`r ntaxa(ps_slv_work_no_mito) - ntaxa(ps_slv_work_no_chloro)` ASVs** classified as Chloroplast. 

3. Remove any Eukaryotic hits.

```{r remove_specific_taxa3}
# generate a file with mitochondria ASVs
EU1 <- subset_taxa(ps_slv_work_no_chloro, Kingdom == "Eukaryota")
EU1 <-  as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_slv_work_no_chloro), EU1df)
ps_slv_work_filt <- prune_taxa(goodTaxa, ps_slv_work_no_chloro)
ps_slv_work_filt
```

Finally, **`r ntaxa(ps_slv_work_no_chloro) - ntaxa(ps_slv_work_filt)` Eukaryota ASVs** were eliminated from the `ps` object. 

***

**Note** the command used for this operation  (`subset_taxa` within phyloseq)  removes anything that is `NA` for the taxonomic level of interest or above. That's fine if you are working at the Phylum level---for example is you are only getting rid of unknown Bacteria and unknown unknowns. However if you want to  run this command using `Family != "Mitochondria"` you will not only get rid all ASVs classified as Mitochondria but everything  else that has `NA` for Family and above. This seems pretty insane to me. Our dataset only had 27 Mitochondria ASVs but running the  command *as is* will remove any  ASVs not classified at Family level or above. And it is not well documented that the command is doing this---I had to go digging through the files to figure out what was happening. So lets see if we can get rid of **just** Mitochondria ASVs.

***

Our final working phyloseq object (`ps_slv_work_filt`) contains

* `r sum(otu_table(ps_slv_work_filt))` total reads  
* and `r ntaxa(ps_slv_work_filt)` ASVs. 

```{r}
mean_reads <- as.integer(mean(sample_sums(ps_slv_work_filt)))
min_reads <- as.integer(min(sample_sums(ps_slv_work_filt)))
max_reads <- as.integer(max(sample_sums(ps_slv_work_filt)))
```

The mean number of reads per sample is **`r mean_reads`**. The minimum number of reads per sample is **`r  min_reads`** while the  maximum is **`r max_reads`** reads.

```{r}
ps_slv_work_filt
```

We will again export the sequence and taxonomy tables to the `DATA_TABLES` directory this time with the file prefix `trim` to indicate that contaminants have been removed from these data.

```{r export_seq_tax_tables_2}
write.table(tax_table(ps_slv_work_filt), "DATA_TABLES/OTHER_TABLES/trim_tax_table.txt", sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_slv_work_filt)), "DATA_TABLES/OTHER_TABLES/trim_seq_table.txt", sep="\t", quote = FALSE, col.names=NA)
```

### 4. **Sample summary**: add total reads and total ASVs to sample summary table{data-commentary-width=400}

```{r sample_summary_table}
total_reads <- sample_sums(ps_slv_work_filt)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")

total_asvs <- estimate_richness(ps_slv_work_filt, measures = "Observed")
total_asvs <- total_asvs %>% rownames_to_column("Sample_ID")

sam_details <- samdf
rownames(sam_details) <- NULL

colnames(sam_details) <- c("Sample_ID", "ENV", "Site") 

merge_tab <- merge(sam_details, total_reads, by = "Sample_ID")
merge_tab2 <- merge(merge_tab, total_asvs, by = "Sample_ID")
colnames(merge_tab2) <- c("Sample_ID", "Environment", "Site", 
    "total_reads", "total_ASVs")

datatable(merge_tab2, rownames = FALSE, width = "100%", 
    caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;", 
    "Table: ", htmltools::em("Sample summary.")), extensions = "Buttons", 
    options = list(columnDefs = list(list(className = "dt-left", targets = 0)), 
        dom = "Blfrtip", buttons = c("csv", 
            "copy"), scrollX = TRUE, scrollCollapse = FALSE))
```

***

At this point we can amend the sample summary table with total reads and ASVs for each sample. If you sort the table by total_reads you can see that 5 samples (all Coral) have less than 1000 reads. We should probably eliminate those samples from the analysis. `TODO`

***

One last thing to do is to create a merged phyloseq object grouped by environment. This will come in handy later for some analyses. Basically we collapsed all samples from the same environment type together.

```{r merge, warning=FALSE}
mergedGP <- merge_samples(ps_slv_work_filt, "ENV")
SD <- merge_samples(sample_data(ps_slv_work_filt), "ENV")
mergedGP
sample_names(mergedGP)
```

Great, still `r ntaxa(mergedGP)` ASVs but now only `r nsamples(mergedGP)` "samples" corresponding to the unique environment types. 

### 5. **Conclusion of Data Preparation Section** {data-commentary-width=400}

> `ps_slv`

```{r}
ps_slv
```

This phyloseq object contains `r ntaxa(ps_slv)` ASVs, `r sum(otu_table(ps_slv))` total reads, and `r nsamples(ps_slv)` samples. 

> `ps_slv_work_filt`

```{r}
ps_slv_work_filt
```

This phyloseq object contains `r ntaxa(ps_slv_work_filt)` ASVs, `r sum(otu_table(ps_slv_work_filt))` total reads, and `r nsamples(ps_slv_work_filt)` samples after removing Mitochondria, Chloroplast, and Eukaryotes. 

> `mergedGP`

```{r}
mergedGP
```

This phyloseq object contains `r ntaxa(ps_slv_work_filt)` ASVs, `r sum(otu_table(ps_slv_work_filt))` total reads, and `r nsamples(ps_slv_work_filt)` samples after collapsing by Environment (ENV).

***

At this point in the workflow there are  the several different phyloseq objects to chose from and, using the above methods, additional objects can easily be created. 

III. Taxonomic Diversity {.storyboard}
=========================================

### **Overview**   {data-commentary-width=600}

Much of the subsequent taxonomic analyses  will be at the **Class** & **Family** levels. We chose not to focus on the Genus level because there simply is not enough resolution in our dataset to build a cohesive story. This is because most marine systems  are (microbially) understudied and we are dealing with short read data. On the other hand, Phylum level is too coarse for groups like Proteobacteria and Firmicutes. Order did not provide any additional information and can be cumbersome for taxa with poorly resolved lineages. Depending on the dataset, you may want to change your strategy. 

Color blindness & graphics

Before we continue we need to talk about color. Microbial diversity is huge and it is difficult to display all of this diversity in a single, static figure. As microbial ecologists we often use colors to convey our message. Yes many of us have a decreased ability to see color or differences in color. So for our figures we are going to create a custom, color friendly palette based on Bang Wong's scheme described in this [paper](http://dx.doi.org/10.1038/nmeth.1618){target="_blank"}. Wong's color scheme uses contrasting colors that can be distinguished by folks with color vision deficiency---roughly 8% of people (mostly males) are color blind. Do want Keanu Reeves to understand your figures or not? 

This scheme is conservative---there are only 7 colors. We added black and grey to give us a little wiggle room. Others have developed [12 and 15 color palettes](http://mkweb.bcgsc.ca/colorblind/){target="_blank"}, but be careful---figures with too many colors can inhibit our ability to discern patterns. Our conservative palette forces us to choose carefully when deciding which taxa to target or how many groups to display. 

***

We created two different color palettes---one for taxa and one for environment---using the same 9 colors. 

```{r define_color_blind_scheme}
friend_pal <- c("#009E73", "#D55E00", "#F0E442", "#CC79A7", "#56B4E9", 
    "#E69F00", "#0072B2", "#7F7F7F", "#000000")

samp_pal <- c("#CC79A7", "#E69F00", "#009E73", "#56B4E9")

plot(x=1:9, y=rep(5,9), pch=19, cex=7, col=friend_pal, 
     col.axis = 'white', col.lab = 'white')

```

### **Total reads & ASVs by Taxa**: What are the dominant taxa in this system? Here we take a Class-level look at overall diversity.   {data-commentary-width=300}

```{r diversity_table}
# generate the ASV table
tax_asv <- table(tax_table(ps_slv_work_filt)[, "Class"], exclude = NULL, 
    dnn = "Taxa")
tax_asv <- as.data.frame(tax_asv, make.names = TRUE)
# generate the reads table
tax_reads <- factor(tax_table(ps_slv_work_filt)[, "Class"])
tax_reads <- apply(otu_table(ps_slv_work_filt), MARGIN = 1, function(x)
{
    tapply(x, INDEX = tax_reads, FUN = sum, na.rm = TRUE, simplify = TRUE)
})
tax_reads <- as.data.frame(tax_reads, make.names = TRUE)
tax_reads <- cbind(tax_reads, reads = rowSums(tax_reads))
tax_reads <- tax_reads[31]
tax_reads <- setDT(tax_reads, keep.rownames = TRUE)[]
# merge the two tables and make everything look pretty
# in an interactive table

taxa_read_asv_tab <- merge(tax_reads, tax_asv, by.x = "rn", by.y = "Taxa")
top_reads <- top_n(taxa_read_asv_tab, n = 6, wt = reads)
top_asvs <- top_n(taxa_read_asv_tab, n = 6, wt = Freq)

names(taxa_read_asv_tab) <- c("Taxa", "total reads", "total ASVs")

write.table(taxa_read_asv_tab, "DATA_TABLES/SUPPLEMENTARY_TABLES/Table_S2.txt", sep = "\t", row.names = FALSE, 
    quote = FALSE)

datatable(taxa_read_asv_tab, rownames = FALSE, width = "100%", colnames = c("Taxa", "total reads", "total ASVs"), caption = htmltools::tags$caption(style = "caption-side: top; text-align: left;", 
    "Table 2: ", htmltools::em("Total reads & ASVs by Class")), extensions = "Buttons", 
    options = list(columnDefs = list(list(className = "dt-left", targets = 0)), 
        dom = "Blfrtip", pageLength = 10, lengthMenu = c(5, 10, 50, 100), buttons = c("csv", 
            "copy")))

top_reads2 <- top_reads[,-1]
rownames(top_reads2) <- top_reads[,1]

top_asvs2 <- top_asvs[,-1]
rownames(top_asvs2) <- top_asvs[,1]

```

***

>**Most reads by taxa**: `r row.names(top_reads2)`

>**Most ASVs by taxa**: `r row.names(top_asvs2)`

### **Relative read abundance**: In this section we calculate relative percent read abundance for the combined dataset. This allows us to collapse rare taxa into an **Other** category. {data-commentary-width=500}

```{r calc_rel_abund_and_merge}
# calculate the averages and merge by species
ps_slv_filt_AVG <- transform_sample_counts(ps_slv_work_filt, function(x) x/sum(x))
mergedGP_BAR <- merge_samples(ps_slv_filt_AVG, "ENV")
SD_BAR <- merge_samples(sample_data(ps_slv_filt_AVG), "ENV")

# merge taxa by rank. If you choose a different rank be sure to change
# the rank throughout this code chunk
mdata_phy <- tax_glom(mergedGP_BAR, taxrank = "Class", NArm = FALSE)  
mdata_phyrel <- transform_sample_counts(mdata_phy, function(x) x/sum(x))
meltd <- psmelt(mdata_phyrel)
meltd$Class <- as.character(meltd$Class)

# calculate the total relative abundance for all taxa
means <- ddply(meltd, ~Class, function(x) c(mean = mean(x$Abundance)))
means$mean <- round(means$mean, digits = 5)
taxa_means <- means[order(-means$mean), ]  # this order in decending fashion
taxa_means <- format(taxa_means, scientific = FALSE)  # ditch the sci notation 

top_perc <- top_n(taxa_means, n = 8, wt = mean)
top_perc$mean <- round(as.numeric(top_perc$mean), digits = 5)
min_top_perc <- round(as.numeric(min(top_perc$mean)), digits = 5)
```

```{r rel_abund_table}
datatable(taxa_means, rownames = FALSE, width = "65%", colnames = c("Class", 
    "mean"), 
    caption = htmltools::tags$caption(style = 
        "caption-side: bottom; text-align: left;", "Table NA: ", 
        htmltools::em("Class-level relative abundance.")), 
    extensions = "Buttons", options = list(columnDefs = 
        list(list(className = "dt-center", targets = "_all")), 
         dom = "Blfrtip", pageLength = 10, lengthMenu = c(10, 50, 100), 
         buttons = c("csv", "copy")))

write.table(taxa_means, "DATA_TABLES/OTHER_TABLES/Table_NA_Taxa_Means.txt", sep = "\t", row.names = FALSE, 
    quote = FALSE)

```

***
Lets take a closer look Class-level taxonomic content of these communities. There are numerous ways to do this but here we chose to collapse samples by environment and display the relative abundance of the most dominant taxa. We also generated alternative views of taxonomic composition for individual samples---a [box-and-whisker plot](#box-and-whisker-plot) as well as [separate bar plots](#separate-bar-plots). 

I know stacked bar charts are pretty lame but we like them for a birds eye view of the data. Here we calculate the relative abundance of taxa for each environment at the **Class** level. It turns out this is not too easy in phyloseq and there is a lot of (messy) code. 

Since our goal is to generate a figure and we only have 9 colors, some taxa will need to be put into an **Other** category. We can define 'Other' however we like so lets take a look at the overall relative abundance of each Class. 

Inspecting the table it looks like if we choose a **cutoff of `r min_top_perc`** (`r min_top_perc*100`%) we get 8 taxa---sounds pretty good. The rest go into the 'Other' category. No matter what, we will always gloss over some groups using such a coarse approach. But as we see later some of these lower abundance groups will reappear when we look at communities at the level of individual ASVs. Next we define the **Other** category and draw the graph. 

### **Class abundance across sample type**: Here we generate a bar chart to look at taxonomic abundance for each of the main environment types. {data-commentary-width=300}

```{r define_other}
# Here we conglomerate at 2%.
Other <- means[means$mean < min_top_perc, ]$Class  
# or you can chose specifc taxa like this
#Other_manual <- c("list", "taxa", "in", "this", "format")

```

```{r metld_bar}
meltd[meltd$Class %in% Other, ]$Class <- "Other"
samp_names <- aggregate(meltd$Abundance, by = list(meltd$Sample), FUN = sum)[, 1]
.e <- environment()
meltd[, "Class"] <- factor(meltd[, "Class"], sort(unique(meltd[, "Class"])))
meltd <- meltd[order(meltd[, "Class"]), ]
# Here we order Classes by the Phylum they belong to.
#(meltd$Class)
meltd$Class <- factor(meltd$Class, levels = c("Bacteroidia", "Alphaproteobacteria", 
                                              "Gammaproteobacteria", "Oxyphotobacteria", 
                                              "Deltaproteobacteria", "Clostridia", 
                                              "Spirochaetia", "Mollicutes", "Other"))  
```

```{r plot_bar_fig2A, fig.align = "center", fig.cap = "Abundance of bacterial taxa across environment", fig.height = 3}
fig2A <- ggplot(meltd, aes_string(x = "Sample", y = "Abundance", fill = "Class"), 
    environment = .e, ordered = TRUE, xlab = "x-axis label", ylab = "y-axis label") 
fig2A <- fig2A + geom_bar(stat = "identity", position = position_stack(reverse = TRUE), 
    width = 0.95) + coord_flip() + theme(aspect.ratio = 1/2)
fig2A <- fig2A + scale_fill_manual(values = friend_pal)
fig2A <- fig2A + theme(axis.text.x = element_text(angle = 0, hjust = 0.45, 
    vjust = 1))
fig2A <- fig2A + guides(fill = guide_legend(override.aes = list(colour = NULL), 
    reverse = FALSE)) + theme(legend.key = element_rect(colour = "black"))
fig2A <- fig2A + labs(x = "Environment", y = "Relative abundance (% total reads)")
fig2A <- fig2A + theme(axis.line = element_line(colour = "black"), panel.grid.major = element_blank(), 
    panel.grid.minor = element_blank(), panel.border = element_rect(colour = "black", 
        fill = NA, size = 1))
fig2A
pdf("FIGURES/MARKDOWN_OUTPUT/Figure_2A.pdf")
fig2A
invisible(dev.off())
```

***

At a 1.5% abundance cutoff, `r length(Other)` Classes are grouped into the 'Other' category. Great, now we can craft the bar chart. And here are the taxa that will go into the chart:

It took some tweaking to get the bar chart to look just right---so there is a lot of code here---and it could most certainly be better :/. While we're at it we will also save a copy of the figure so we can tweak it later and make it look pretty.

One thing to notice is that the sediment samples have a large percentage of ASVs grouped into the **Other** category. We will  see what these taxa are in the next panel.

### **Class abundance across sample type**: Here we generate a modified bar chart separated by individual samples. {data-commentary-width=400}

```{r supp_fig1_calc, cache = TRUE}
mdata_phy_all <- tax_glom(ps_slv_filt_AVG, taxrank = "Class", NArm = FALSE)
# You can choose any taxonomic level here
mdata_phyrel_all <- transform_sample_counts(mdata_phy_all, function(x) x/sum(x))
meltd_all <- psmelt(mdata_phyrel_all)
meltd_all$Class <- as.character(meltd_all$Class)

means_all <- ddply(meltd_all, ~Class, function(x) c(mean = mean(x$Abundance)))
means_all$mean <- round(as.numeric(means_all$mean), digits = 5)
taxa_means_all <- means_all[order(-means_all$mean), ]  # decending order
taxa_means_all <- format(taxa_means_all, scientific = FALSE)  # ditch the sci notation 

top_perc_all <- top_n(taxa_means_all, n = 8, wt = mean)
top_perc_all$mean <- round(as.numeric(top_perc_all$mean), digits = 5)
min_top_perc_all <- round(as.numeric(min(top_perc_all$mean)), digits = 5)

Other <- means_all[means_all$mean < min_top_perc_all, ]$Class  # Here we conglomerate at 2%.

meltd_all[meltd_all$Class %in% Other, ]$Class <- "Other"
samp_names <- aggregate(meltd_all$Abundance, by = list(meltd_all$Sample), 
    FUN = sum)[, 1]
.e <- environment()
meltd_all[, "Class"] <- factor(meltd_all[, "Class"], sort(unique(meltd_all[, 
    "Class"])))
meltd_all <- meltd_all[order(meltd_all[, "Class"]), ]
#levels(meltd_all$Class)
meltd_all$Class <- factor(meltd_all$Class, levels = c(
                     "Bacteroidia", "Alphaproteobacteria", 
                     "Gammaproteobacteria", "Oxyphotobacteria", 
                     "Clostridia", "Deltaproteobacteria", 
                      "Spirochaetia", "Mollicutes",
                      "Other"))  # Here we order Classes by the Phylum they belong to.
```

```{r supp_fig1, include=FALSE, echo=FALSE, eval=FALSE}
sup_fig1 <- qplot(data = meltd_all, x = ENV, y = Abundance, fill = Class, 
    geom = "boxplot", ylab = "Relative Abundance") + theme(legend.position = "none") + 
    facet_grid(Class ~ ., scales = "free_y", space = "free_y") + geom_jitter(width = 0.05) + 
    geom_point(colour = "black", fill = "white")  #+ guides(guide_legend(reverse = FALSE) )
sup_fig1 <- sup_fig1 + scale_fill_manual(values = friend_pal) + labs(x = "Environment", 
    y = "Relative abundance (% total reads)")
sup_fig1
pdf("FIGURES/Figure_S1A.pdf")
sup_fig1
invisible(dev.off())
```

```{r separate_bars_stacked}
options(scipen = 3)
options(digits = 4)
sup_fig2 <- ggplot(meltd_all, aes_string(x = "Sample", y = "Abundance", 
    fill = "Class"), environment = .e, Ordered = TRUE)
sup_fig2 <- sup_fig2 + geom_bar(stat = "identity", position = "stack") + 
    facet_grid(Class ~ ENV, scales = "free", space = "free_y")
sup_fig2 <- sup_fig2 + scale_fill_manual(values = friend_pal)

# sup_fig2 <- sup_fig2 + theme(axis.text.x = element_text(angle = -90,
# hjust = 0))
sup_fig2 <- sup_fig2 + theme(axis.text.x = element_blank())
sup_fig2 <- sup_fig2 + guides(fill = guide_legend(override.aes = list(colour = NULL), 
    reverse = FALSE)) + theme(legend.key = element_rect(colour = "black"), 
                              legend.position = "none") + 
    labs(x = "Individual samples", y = "Relative abundance (% total reads)")

plotly::ggplotly(sup_fig2, tooltip = c('Sample', 'Abundance'), 
                 originalData = TRUE, width = 600, height = 800) 

#sup_fig2
#pdf("FIGURES/SUPPLEMENTARY_FIGURES/Figure_S1B.pdf")
#sup_fig2
#invisible(dev.off())

```

***

Here we see the same data as in the last figure except now presented as individual samples. We see that the sediment samples all have a larger percentage of *Other* taxa. If we examine Phylum level abundance in sediment samples we see several groups that are not well represented in the rest of the dataset including 


```{r sed_calc}
ps_slv_filt_AVG_sed <- prune_samples(c("SCC1", "SCC2", "SCC3", "SCR1", "SCR2", "SCR3"), ps_slv_filt_AVG)
ps_slv_filt_AVG_sed <- prune_taxa(taxa_sums(ps_slv_filt_AVG_sed) > 0, ps_slv_filt_AVG_sed)
mdata_phy_sed <- tax_glom(ps_slv_filt_AVG_sed, taxrank = "Phylum", NArm = FALSE)
mdata_phyrel_sed <- transform_sample_counts(mdata_phy_sed, function(x) x/sum(x))
meltd_sed <- psmelt(mdata_phyrel_sed)
meltd_sed$Class <- as.character(meltd_sed$Phylum)
means_sed <- ddply(meltd_sed, ~Phylum, function(x) c(mean = mean(x$Abundance)))
means_sed$mean <- round(as.numeric(means_sed$mean), digits = 5)

mdata_phy_comp <- tax_glom(ps_slv_filt_AVG, taxrank = "Phylum", NArm = FALSE)
mdata_phyrel_comp <- transform_sample_counts(mdata_phy_comp, function(x) x/sum(x))
meltd_comp <- psmelt(mdata_phyrel_comp)
meltd_comp$Class <- as.character(meltd_comp$Phylum)
means_comp <- ddply(meltd_comp, ~Phylum, function(x) c(mean = mean(x$Abundance)))
means_comp$mean <- round(as.numeric(means_comp$mean), digits = 5)

merge_sed_comp <- merge(means_sed, means_comp,  by = "Phylum", all = TRUE)
merge_sed_comp <- merge_sed_comp[order(-merge_sed_comp$mean.x), ]
colnames(merge_sed_comp) <- c("Phylum",  "mean Sed samples", "mean All samples")

kable(merge_sed_comp, row.names = FALSE) %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)  %>%
  column_spec(1:3, width = "2.5cm")
```



IV. $\alpha$-Diversity {.storyboard}
=========================================

### **Overview**: Armed with a picture of taxonomic composition we can move on to diversity estimates.  {data-commentary-width=600}


### **$\alpha$-diversity estimates** Here we use several diversity indices and add the results to the summary table. 

```{r gen_summary_table, warning = FALSE}
diversity <- estimate_richness(ps_slv_work_filt, measures=c(
            "Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher"))
diversity_calc <- diversity %>% rownames_to_column("Sample_ID")
# round values
diversity_calc[c(3,5,10)] <- round(diversity_calc[c(3,5,10)], 1)
diversity_calc[c(4,6,7,9)] <- round(diversity_calc[c(4,6,7,9)], 2)
diversity_calc[8] <- round(diversity_calc[8], 3)

#host_summary <- merge(host_details, diversity_calc)
#host_summary$Observed <- NULL
merge_tab2$total_ASVs <- NULL
sample_table <- merge(merge_tab2, diversity_calc, by = "Sample_ID")

write.table(sample_table, "DATA_TABLES/SUPPLEMENTARY_TABLES/Table_S1.txt", 
            sep = "\t", row.names = FALSE, quote = FALSE)

datatable(sample_table, rownames = FALSE, width = "65%", 
          colnames = c("Sample_ID", "Environment", "Site", "total reads",
                       "No. ASVs", "Chao1", "Chao1 (SE)", 
                       "ACE", "ACE (SE)", "Shannon", 
                       "Simpson", "InvSimpson", "Fisher"), 
    caption = htmltools::tags$caption(style = 
        "caption-side: bottom; text-align: left;", "Table NA: ", 
        htmltools::em("Class-level relative abundance.")), 
    extensions = "Buttons", options = list(columnDefs = 
        list(list(className = "dt-center", targets = "_all")), 
         dom = "Blfrtip", pageLength = 10, lengthMenu = c(20, 30), 
         buttons = c("csv", "copy")))
```

***

Alpha diversity describes the diversity in a sample or site.  There are several alpha diversity metrics available in phyloseq: `Observed`, `Chao1`, `ACE`, `Shannon`, `Simpson`, `InvSimpson`, `Fisher`. Play around to see how different metrics change or confirm these results. 

Here we want to know if diversity is significantly different across environments. In order to do that we need to know if we should run a parametric or non-parametric test, and for that we need to know if our data is normally distributed. Most of the ideas/code for alpha (and subsequent beta) diversity statistics come from this [workshop tutorial](https://rpubs.com/maddieSC/R_SOP_UCR_Jan_2018){target="_blank"} by Kim Dill-McFarland and Madison Cox.

First we run the diversity estimates, add these data to our summary table, and save a copy of this table.

### **$\alpha$-diversity statistics**: Test whether the results are normally distributed. Results of this will help guide the analyses we do next. {data-commentary-width=500}

```{r alpha_div_test_norm, warning = FALSE}
# Convert to ps object
sample_div <- sample_data(diversity) 
# Create new ps object with diversity estimates added to sample_data
ps_slv_work_filt_div <- merge_phyloseq(ps_slv_work_filt, sample_div)
# Run Shapiro test
shapiro_test_Shan <- shapiro.test(sample_data(ps_slv_work_filt_div)$Shannon)
shapiro_test_invSimp <- shapiro.test(sample_data(ps_slv_work_filt_div)$InvSimpson)
shapiro_test_Chao1 <- shapiro.test(sample_data(ps_slv_work_filt_div)$Chao1)
shapiro_test_Observed <- shapiro.test(sample_data(ps_slv_work_filt_div)$Observed)
```

Shapiro-Wilk Normality Test for **Shannon** index.
```{r shap_Shan, echo = FALSE}
shapiro_test_Shan
```

Shapiro-Wilk Normality Test for **inverse Simpson** index.
```{r shap_invS, echo = FALSE}
shapiro_test_invSimp
```

Shapiro-Wilk Normality Test for **Chao1 richness** estimator.
```{r shap_Choa1, echo = FALSE}
shapiro_test_Chao1
```

Shapiro-Wilk Normality Test for **Observed ASV richness** estimator.
```{r shap_Observed, echo = FALSE}
shapiro_test_Observed
```

***

Next, we add the diversity estimates to our phyloseq object, and test if the data are normally distributed using Shapiro-Wilk Normality test. We will focus on the inverse Simpson and Shannon diversity estimates, Chao’s richness estimate, and Observed richness  but this approach can be used for any metric.

Ok, since the p-values are significant for the inverse Simpson, Chao richness, and Observed ASV richness we reject the null hypothesis that these data are normally distributed. However, the Shannon estimates appear normally distributed. So lets see if diversity is significantly different between environment based on the Shannon index. 

### **Assessing significance of $\alpha$-diversity metrics**: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.{data-commentary-width=500}

Normally distributed metrics

```{r normal}
sampledataDF <- data.frame(sample_data(ps_slv_work_filt_div))
sampledataDF$ENV <- as.factor(sampledataDF$ENV)
sampledataDF$SITE <- as.factor(sampledataDF$SITE)
aov.shannon = aov(Shannon ~ ENV, data = sampledataDF)
#Call for the summary of that ANOVA, which will include P-values
summary(aov.shannon)
```

```{r tukey}
TukeyHSD(aov.shannon)
```

Non-normally distributed metrics

Kruskal-Wallis of **inverse Simpson** index.

```{r krusk_invsimp}
#library(FSA)
#dunnTest(InvSimpson ~ Sp, data = sampledataDF, method="bh") 
kruskal.test(InvSimpson ~ ENV, data = sampledataDF)
```

Pairwise significance test for **inverse Simpson** index.

```{r wilcox_invsimp}
pairwise.wilcox.test(sampledataDF$InvSimpson, sampledataDF$ENV, p.adjust.method="fdr")
```

Kruskal-Wallis of **Chao1 richness** estimator.

```{r krusk_chao}
#dunnTest(Chao1 ~ Sp, data = sampledataDF, method="bh") 
kruskal.test(Chao1 ~ ENV, data = sampledataDF)
```

Pairwise significance test for **Chao1 richness** estimator.
```{r wilcox_chao}
pairwise.wilcox.test(sampledataDF$Chao1, sampledataDF$ENV, p.adjust.method="fdr")
```

Kruskal-Wallis of **Observed ASV richness** index.

```{r krusk_observed}
#library(FSA)
#dunnTest(Observed ~ Sp, data = sampledataDF, method="bh") 
kruskal.test(Observed ~ ENV, data = sampledataDF)
```

Pairwise significance test for **Observed ASV richness** index.

```{r wilcox_observed, warning = FALSE}
pairwise.wilcox.test(sampledataDF$Observed, sampledataDF$ENV, p.adjust.method="fdr")
```

***

Since the Shannon data is normally distributed we can test for significance using ANOVA (a parametric test).

Ok, the results of the ANOVA are significant.  Here we use the Tukey's HSD (honestly significant difference) post-hoc test to determine which pairwise comparisons are different.

Looks like everything is significantly different except for Mat-Sediment communities from Cayo Roldan. Interesting.   

Now we can look at the results on the  inverse Simpson diversity and Chao’s richness. Since Environment is categorical, we use Kruskal-Wallis (non-parametric equivalent of ANOVA) to test for significance.

For the inverse Simpson, Chao1, and Observed richness the results of the Kruskal-Wallis rank sum test are significant. So we can look at pairwise comparisons using Wilcoxon rank sum test for post-hoc analysis.


### **$\alpha$-diversity plots**: Test whether diversity metrics are significant and use post-hoc analysis to see which sample groups are different.{data-commentary-width=500}


```{r alpha_div_fig_2B, fig.align = "center", fig.cap = "Figure 2B", warning = FALSE}
samp_pal <- c("#CC79A7", "#E69F00", "#009E73", "#56B4E9")
fig2B <- plot_richness(ps_slv_work_filt, x = "ENV", 
                       measures = c("Observed",
                                    "Shannon", 
                                    "InvSimpson", 
                                    "Chao1"), 
                       color = "ENV", nrow = 1)
fig2B <- fig2B + geom_boxplot() + geom_jitter(width = 0.05)
fig2B <- fig2B + scale_colour_manual(values = samp_pal) + 
         labs(x = "Host species", 
         y = "Diversity", 
         title = "Alpha diversity of microbial 
         communities ")
#fig2B + geom_boxplot(aes(colour = black))
#fig2B <- fig2B + theme_bw() + geom_point(size = 2.5, aes(color = ENV)) + 
fig2B
pdf("FIGURES/MARKDOWN_OUTPUT/Figure_2B.pdf")
fig2B
invisible(invisible(dev.off()))
```

***

Here we plot the results of each diversity index and include the *Observed* ASV richness. We will save a copy of the figure for later tweaking. We use the color palette described above to delineate environments.


V. $\beta$-Diversity {.storyboard}
=========================================

### **$\beta$-diversity analysis using **: NMDS coupled with Jensen–Shannon divergence.   {data-commentary-width=600}

```{r run_nmds, results = 'hide' } 
set.seed(3131)
ord.nmds.jsd_slv <- ordinate(ps_slv_work_filt, method = "NMDS", distance = "bray")
```

```{r ord_results}
ord.nmds.jsd_slv
```

***

Beta diversity basically tells us how similar or dissimilar samples are to one another. Phyloseq offers several ordination `methods` and `distance` metrics. Here we use non metric multidimensional scaling (NMDS) coupled with Jensen–Shannon divergence. We also save a copy of the figure for later tweaking. To see the full output of the NMDS analysis, remove the `results = 'hide'` tag from the code chunk.

We see that a convergent solution was reached around 20 iterations and our stress is below 0.20, meaning that 2-axes are sufficient to view the data. Generally, we are looking for stress values below 0.2. If the stress values are high, you may need to add more axes to the ordination. Lets visualize the plot.


### **$\beta$-diversity NMDS plot **: {data-commentary-width=400}

```{r beta_div_fig_2C, fig.align = "center"}
fig2C <- plot_ordination(ps_slv_work_filt, ord.nmds.jsd_slv, 
                         color = "ENV", 
                         shape = "SITE")
fig2C <- fig2C + geom_point(size = 4.6,  stroke = 0.075) + scale_colour_manual(values = samp_pal)
#fig2C <- fig2C + scale_colour_manual(values = samp_pal)

fig2C <- fig2C + theme(axis.line = element_line(colour = "black"), panel.background = element_blank(), 
    panel.grid.major = element_line("grey"), panel.grid.minor = element_line("grey"), 
    panel.border = element_rect(colour = "black", fill = NA, size = 1)) + 
    theme(legend.key = element_rect(colour = "black"))
fig2C <- fig2C + coord_fixed()
fig2C
#fig2C <- fig2C  + stat_ellipse(type = "t") + theme_bw()
#plotly::ggplotly(fig2C, tooltip = 'ENV',  originalData = TRUE)
#pdf("FIGURES/MARKDOWN_OUTPUT/Figure_2C.pdf")
#invisible(dev.off())
```

***

So we can see some clustering within groups and spread between groups, but this is not a test for statistical differences. Do microbial communities differ significantly by environment? 

### **$\beta$-diversity statistics **: {data-commentary-width=400}

First  we use the `adonis` function in vegan to run a PERMANOVA test. This will tell us whether environments have similar centroids or not. 

```{r ordination_stats_adonis}
set.seed(1911)
fish.jsd <- phyloseq::distance(ps_slv_work_filt, method = "jsd")
sampledf <- data.frame(sample_data(ps_slv_work_filt))
fish_adonis <- adonis(fish.jsd ~ ENV, data = sampledf, permutations = 1000)
fish_adonis
```

These results indicate that centroids are significantly different across environments meaning that communities are different by environments. 

We can also use the `pairwiseAdonis` package for pair-wise PERMANOVA analysis.

```{r pairwise_adonis}
pairwise.adonis(fish.jsd, factors = sampledf$ENV, p.adjust.m = "bonferroni")
```

Here we see  again we see that communities are different by environments. 

However, PERMANOVA assumes equal beta dispersion so we will  use the `betadisper` function from the `vegan` package to calculate beta dispersion values. 

```{r betadisper}
beta_adonis <- betadisper(fish.jsd, sampledf$ENV, bias.adjust = TRUE)
beta_adonis
```

And then a pair-wise Permutation test for homogeneity of multivariate dispersions using `permutest` (again from the `vegan` package). 

```{r permutest}
permutest(beta_adonis, pairwise=TRUE, permutations=1000)
```

These results are significant, meaning that  environments have different dispersions. Looking at the pairwise p-values and permuted p-value, we see that the  significant differences (p-value < 0.05) are between most environments. 

This means we are less confident that the PERMANOVA result is a real result, and that the result is possibly due to differences in group dispersions.

We can also use Analysis of Similarity (ANOSIM)---which does not assume equal group variances---to test whether overall microbial communities differ by environment.

```{r ordination_stats_anosim}
spgroup <- get_variable(ps_slv_work_filt, "ENV")
fish_anosim <- anosim(distance(ps_slv_work_filt, "jsd"), grouping = spgroup)
summary(fish_anosim)
```

And the AN0SIM result is significant meaning that environment influences microbial community composition. 

```{r simper, eval = FALSE, echo = FALSE, include = FALSE}
source("/Users/j/Desktop/FISH_MICROBIOME/02_PHYLOSEQ_WORKFLOW/HELPER_SCRIPTS/simper_pretty.R")
#Using the function
otutab <- as.table(otu_table(ps_slv_work_filt))

stuff <- simper.pretty(otu_table(ps_slv_work_filt), sample_data(ps_slv_work_filt), "Sp", perc_cutoff=0.5, low_cutoff = 'y', low_val=0.01, 'name')
```

***

To test whether microbial communities differ by environment we can use permutational analysis of variance (PERMANOVA) or analysis of similarity (ANOSIM). PERMANOVA does not assume normality but does assume equal beta dispersion between groups. We will test beta dispersion below.